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Abstract 



We study sparse approximate solutions to convex optimization 
problems. It is known that in many engineering applications re- 
searchers are interested in an approximate solution of an optimiza- 
tion problem as a linear combination of elements from a given system 
of elements. There is an increasing interest in building such sparse 
approximate solutions using different greedy-type algorithms. The 
problem of approximation of a given element of a Banach space by 
linear combinations of elements from a given system (dictionary) is 
well studied in nonlinear approximation theory. At a first glance the 
settings of approximation and optimization problems are very differ- 
ent. In the approximation problem an element is given and our task is 
to find a sparse approximation of it. In optimization theory an energy 
function is given and we should find an approximate sparse solution 
to the minimization problem. It turns out that the same technique 
can be used for solving both problems. We show how the technique 
developed in nonlinear approximation theory, in particular, the greedy 
approximation technique can be adjusted for finding a sparse solution 
of an optimization problem. 



1 Introduction 



We study sparse approximate solutions to convex optimization problems. We 
apply the technique developed in nonlinear approximation known under the 
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name of greedy approximation. A typical problem of convex optimization is 
to find an approximate solution to the problem 



under assumption that E is a convex function. Usually, in convex optimiza- 
tion function E is defined on a finite dimensional space W 1 (see [3], [TO]). 
Recent needs of numerical analysis call for consideration of the above opti- 
mization problem on an infinite dimensional space, for instance, a space of 
continuous functions. One more important argument that motivates us to 
study this problem in the infinite dimensional space setting is the following. 
In many contemporary numerical applications the dimension n of the ambi- 
ent space W 1 is large and we would like to obtain bounds on the convergence 
rate independent of the dimension n. Our results for infinite dimensional 
spaces provide such bounds on the convergence rate. Thus, we consider a 
convex function E defined on a Banach space X. It is pointed out in [21] 
that in many engineering applications researchers are interested in an ap- 
proximate solution of problem f 1 1.11) as a linear combination of elements from 
a given system T> of elements. There is an increasing interest in building such 
sparse approximate solutions using different greedy-type algorithms (see, for 
instance, [21], [12], [5], and [20]). The problem of approximation of a given 
element / G X by linear combinations of elements from T> is well studied in 
nonlinear approximation theory (see, for instance [6], [16], p2]). In order to 
address the contemporary needs of approximation theory and computational 
mathematics, a very general model of approximation with regard to a re- 
dundant system (dictionary) has been considered in many recent papers. As 
such a model, we choose a Banach space X with elements as target functions 
and an arbitrary system T> of elements of this space such that the closure of 
spanD coincides with X as an approximating system. 

The fundamental question is how to construct good methods (algorithms) 
of approximation. Recent results have established that greedy type algo- 
rithms are suitable methods of nonlinear approximation in both sparse ap- 
proximation with regard to bases and sparse approximation with regard to 
redundant systems. It turns out that there is one fundamental principal that 
allows us to build good algorithms both for arbitrary redundant systems and 
for very simple well structured bases like the Haar basis. This principal is 
the use of a greedy step in searching for a new element to be added to a 
given sparse approximant. By a greedy step, we mean one which maximizes 





X 
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a certain functional determined by information from the previous steps of 
the algorithm. We obtain different types of greedy algorithms by varying the 
above mentioned functional and also by using different ways of constructing 
(choosing coefficients of the linear combination) the m-term approximant 
from the already found m elements of the dictionary. 

We point out that at a first glance the settings of approximation and 
optimization problems are very different. In the approximation problem an 
element / G X is given and our task is to find a sparse approximation of it. 
In optimization theory an energy function E(x) is given and we should find 
an approximate sparse solution to the minimization problem. It turns out 
that the same technique can be used for solving both problems. 

We show how the technique developed in nonlinear approximation theory, 
in particular, the greedy approximation technique can be adjusted for finding 
a sparse with respect to D solution of problem ( II. ip . 

We begin with a brief description of greedy approximation methods in 
Banach spaces. The reader can find a detailed discussion of greedy approx- 
imation in the book [T7]. Let X be a Banach space with norm || ■ ||. We 
say that a set of elements (functions) T> from X is a dictionary, respectively, 
symmetric dictionary, if each g eD has norm bounded by one (||g|| < 1), 

g ET> implies — g GT>, 

and the closure of span 22 is X. In this paper symmetric dictionaries are 
considered. We denote the closure (in X) of the convex hull of T> by A\(T>). 
For a nonzero element / G X we let Ff denote a norming (peak) functional 
for /: 

\\F f \\ = l, *>(/) = ||/||. 
The existence of such a functional is guaranteed by Hahn-Banach theorem. 
We describe a typical greedy algorithm from a family of dual greedy algo- 
rithms. Let t := {tfcj^i be a given weakness sequence of nonnegative 
numbers i& < 1, k = 1,.... We define first the Weak Chebyshev Greedy 
Algorithm (WCGA) (see [H]) that is a generalization for Banach spaces of 
the Weak Orthogonal Greedy Algorithm. 

Weak Chebyshev Greedy Algorithm (WCGA). We define /q := 
/q' t := /. Then for each m > 1 we have the following inductive definition. 

(1) := ip^ G V is any element satisfying 
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(2) Define 

$ m := § T m := span{<^}™ =1 , 

and define G c m := G^ to be the best approximant to / from $ m . 

(3) Let 

Jm " Jm ' J m' 

Let us make a remark that justifies the idea of the dual greedy algo- 
rithms in terms of real analysis. We consider here approximation in uni- 
formly smooth Banach spaces. For a Banach space X we define the modulus 
of smoothness 

p(u) := ^ sup (-(||a; + uy\\ + \\x - uy\\) - 1). 

The uniformly smooth Banach space is the one with the property 

lim p(u)/u = 0. 

u— >0 

We note that from the definition of modulus of smoothness we get the 
following inequality. 

< ||ar + uy\\ - \\x\\ - uF x {y) < 2||x||p(u||j/||/||a:||). (1.2) 

This inequality implies the proposition. 

Proposition 1.1. Let X be a uniformly smooth Banach space. Then, for 
any x ^ and y we have 

p x (y) = (4^\\ x + uyll^j (°) = Sjdl 35 + u y\\ - \\ x \\)/ u - ( 1 - 3 ) 

Proposition 11.11 shows that in the WCGA we are looking for an element 
ip m G V that provides a big derivative of the quantity ||/ m _i + ug\\. Here is 
one more important greedy algorithm. 

Weak Greedy Algorithm with Free Relaxation (WGAFR). Let 

T '■= {t m }m=n tm 6 [0, 1], be a weakness sequence. We define f := / and 
Gq := 0. Then for each m > 1 we have the following inductive definition. 
(1) ip m G V is any element satisfying 

(<Pm) > tmSUpFf^g). 
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(2) Find w m and A m such that 



||/ - ((1 - u7 m )G m _i + A m v3 m )|| = inf ||/ - ((1 - w)G m -i + A<^ m )|| 



and define 



G 



m 



(1 - w m )G m -i + A 



(3) Let 



rn 



It is known that both algorithms WCGA and WGAFR converge in any uni- 
formly smooth Banach space under mild conditions on the weakness sequence 
{tk}, for instance, t k = t, k = 1,2, . . . , t > 0, guarantees such convergence. 
The following theorem provides rate of convergence (see [T7], pp. 347, 353). 

Theorem 1.1. Let X be a uniformly smooth Banach space with modulus of 
smoothness p(u) < r yu q , 1 < q < 2. Take a number e > and two elements 
f; f e from X such that 



with some number A(e) > 0. Then, for both algorithms WCGA and WGAFR 
we have (p := qj (q — 1)) 



The above Theorem 11.11 simultaneously takes care of two issues: noisy 
data and approximation in an interpolation space. In order to apply it for 
noisy data we interpret / as a noisy version of a signal and f e as a nois- 
less version of a signal. Then, assumption f e /A(e) G A\{T>) describes our 
smoothness assumption on the noisless signal. Theorem 11.11 can be applied 
for approximation of / under assumption that / belongs to one of interpola- 
tion spaces between X and the space generated by the A\(T>)-norai (atomic 
norm). We now make a remark showing that the A\{T>)-\ioim. (in other 
words, the assumption f /A 6 Ai(T>)) appears naturally in convex optimiza- 
tion problems. 

It is pointed out in [7] that there has been considerable interest in solving 
the convex unconstrained optimization problem 



||/-r||< e , f e /A(e)eA 1 (V) 





rin — \\y — &x\\l + A || x ||i 
x 2 



(1.4) 



5 



where x e K", 1/ 6 I k , $ is an k x n matrix, A is a nonnegative parameter, 
|| v || 2 denotes the Euclidian norm of v, and \\v ||i is the £1 norm of v. Problems 
of the form (jl.4p have become familiar over the past three decades, partic- 
ularly in statistical and signal processing contexts. Problem (jl.4p is closely 
related to the following convex constrained optimization problem 

min —\\y — Qx\\l subject to ||x||i < A. (1-5) 

The above convex optimization problem can be recast as an approximation 
problem of y with respect to a dictionary V := {±y?j}™ =1 which is associated 
with a k x n matrix $ = [tpi . . . ip n ] with tpj G ~R k being the column vectors 
of $. The condition y G Ai(T>) is equivalent to existence of x G M m such 
that y = and 

||x||i := \xi\ H h |x m | < 1. (1.6) 

As a direct corollary of Theorem 11.11 we get for any y G A\ (T>) that the 
WCGA and the WGAFR with r = {t} guarantee the following upper bound 
for the error 

\\Vkh < Ck- 1 ' 2 . (1.7) 

The bound (11.71) holds for any T> (any $). 

We note that in the study of greedy-type algorithms in approximation 
theory (see [T7j) emphasis are put on the theory of approximation with re- 
spect to arbitrary dictionary T>. The reader can find examples of specific 
dictionaries of interest in [17] and [20]. We present some results on sparse 
solutions for convex optimization problems in the setting with an arbitrary 
dictionary D. 

We generalize the algorithms WCGA and WGAFR to the case of convex 
optimization and prove an analog of Theorem 11.11 for the new algorithms. 
Let us illustrate this on the generalization of the WGAFR. 

We assume that the set 

D :={x: E(x) < E(0)} 

is bounded. For a bounded set D define the modulus of smoothness of E on 
D as follows 

p(E,u):=~ sup \E(x + uy) + E(x-uy)-2E(x)\. (1.8) 

^ x£D,\\y\\=l 
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We assume that E is Frechet differential) le. Then convexity of E implies 
that for any x, y 

E(y)>E(x) + (E'(x),y-x) (1.9) 

or, in other words, 

E(x)-E(y) < (E'(x),x-y) = (-E'(x), y - x). (1.10) 
We will often use the following simple lemma. 

Lemma 1.1. Let E be Frechet differentiable convex function. Then the 
following inequality holds for x G D 

< E(x + uy) -E(x) -u{E'{x),y) < 2p(E,u\\y\\). (1.11) 

Proof. The left inequality follows directly from ( II. 9p . Next, from the defini- 
tion of modulus of smoothness it follows that 

E(x + uy) + E(x - uy) < 2{E{x) + p(E, u\\y\\)). (1.12) 

Inequality ( II. 9p gives 

E{x-uy) >E(x) + (E'(x),-uy) = E(x) - u(E'(x), y). (1.13) 

Combining HI . 12[) and fll . 13j) . we obtain 

E(x + uy) < E(x) + u(E'(x),y) + 2p(E, u\\y\\). 

This proves the second inequality. □ 

Weak Greedy Algorithm with Free Relaxation (WGAFR(co)). 

Let r := {t m }^ =l , t m G [0, 1], be a weakness sequence. We define Go : = 0. 
Then for each m > 1 we have the following inductive definition. 

(1) ip m G V is any element satisfying 

(-£"(G m _i), (p m ) >t m sup( E {G m —\),g). 

g ev 

(2) Find w m and A m such that 

E((l - w m )G m -i + Xm^m) = inf E((l - w)G m - 1 + Xip m ) 

and define 

G m := (1 — w m )G m -i + \ m ip m . 
In Section 4 we prove the following rate of convergence result. 
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Theorem 1.2. Let E be a uniformly smooth convex function with modulus 
of smoothness p(E,u) < r yu q , 1 < q < 2. Take a number e > and an 
element f e from D such that 

E(f) < mi E(x) + e, f/A{e) G A 1 (V), 
with some number A(e) > 1. Then we have for WGAFR(co) (p := q/(q — l)) 

E(G m ) - wfE(x) < max ( 2e,C(q,j)A(e) [ C(E,q,j) + 



k=l 



oo 

m=0 



We note that in all algorithms studied in this paper the sequence {G m } 
of approximants satisfies the conditions 

G = 0, J E(Go)> J E(G 1 )> J E(G 2 )>.... 

This guarantees that G m G D for all m. 

This paper is the first author's paper on greedy-type methods in con- 
vex optimization. It is a slight modification of the paper [IS]. For the 
reader's convenience we now give a brief general description and classifi- 
cation of greedy-type algorithms for convex optimization. The most difficult 
part of an algorithm is to find an element (p m G T> to be used in approxima- 
tion process. We consider greedy methods for finding <p m G V. We have two 
types of greedy steps to find <p m G V. 

I. Gradient greedy step. At this step we look for an element tp m G T> 
such that 

(-E'(G m -x),(p m ) > t m sup(-E / (G m _i),5(). 

II. .E-greedy step. At this step we look for an element tp m G T> which 
satisfies (we assume existence): 

inf £(G m _i + cip m ) = inf £(G m _i + eg). 



The above WGAFR(co) uses the greedy step of type I. In this paper we 
only discuss algorithms based on the greedy step of type I. These algorithms 
fall into a category of the first order methods. The greedy step of type II 
uses only the function values E(x). We discussed some of the algorithms of 
this type in [TH] and plan to study them in our future work. 
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After we found ip m G T> we can proceed in different ways. We now 
list some typical steps that are motivated by the corresponding steps in 
greedy approximation theory (see p2]). These steps or their variants are used 
in optimization algorithms like gradient method, reduced gradient method, 
conjugate gradients, gradient pursuits (see, for instance, [8], [10], [9], [TTJ, [TJ 
and 0). 

(A) Best step in the direction ip m G V. We choose c m such that 

E(G 

m—l + c m ip m ) = inf E(G m—l 

and define 

G m . G m —\ -\- c m (p m . 

(B) Reduced best step in the direction ip m G V. We choose c m as in (A) 
and for a given parameter b > define 

G m \= G m _i + bc m ip m . 

Usually, b G (0, 1). This is why we call it reduced. 

(C) Chebyshev-type methods. We choose G m G span((/?!, . . . , ip m ) which 
satisfies 

E(G m ) = inf E(cifi H \-c m (p m ). 

Cj,j=l,...,m 

(D) Fixed relaxation. For a given sequence {rk}^ =1 of relaxation param- 
eters rk G [0, 1) we choose G m := (1 — r m )G m _i + c m ip m with c m from 

E((l - r m )G m _i + c m y? m ) = inf - r m )G m _i + cv? m ). 

(F) Free relaxation. We choose G m G span(G m _i, (p m ) which satisfies 

E(G m ) = inf £(ciG m _i + c 2 v?m)- 

Cl,C 2 

(G) Prescribed coefficients. For a given sequence {cfc}^ of positive co- 
efficients in the case of greedy step I we define 

G m := G m _i + c m ip m . (1-14) 

In the case of greedy step II we define G m by formula (11.141) with the greedy 
step II modified as follows: ip m G T> is an element satisfying 

E(G m -i + c m ip m ) = inf E(G m -i + c m g). 
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We prove convergence and rate of convergence results here. Our setting in 
an infinite dimensional Banach space makes the convergence results nontriv- 
ial. The rate of convergence results are of interest in both finite dimensional 
and infinite dimensional settings. In these results we make assumptions on 
the element minimizing E(x) (in other words we look for inf^s E(x) for a 
special domain S). A typical assumption in this regard is formulated in terms 
of the convex hull Ai(T>) of the dictionary T>. 

We have already mentioned above (see (11. 5p and below) an example which 
is of interest in applications in compressed sensing. We now mention another 
example that attracted a lot of attention in the recent literature. In this 
example X is a Hilbert space of all real matrices of size n x n equipped 
with the Frobenius norm || • ||^. A dictionary T> is the set of all matrices of 
rank one normalized in the Frobenius norm. In this case Ai(T>) is the set 
of matrices with nuclear norm not exceeding 1. We are interested in sparse 
minimization of E(x) := ||/ — x\\^ (sparse approximation of /) with respect 
to V. 

2 The Weak Chebyshev Greedy Algorithm 

We begin with the following two simple and well-known lemmas. 

Lemma 2.1. Let E be a uniformly smooth convex function on a Banach 
space X and L be a finite- dimensional subspace of X . Let xl denote the 
point from L at which E attains the minimum: 

E(x L ) = inf E(x). 

Then we have 

(E'(x L ),<P) = 

for any <p G L. 

Proof. Let us assume the contrary: there is a <p £ L such that \\<j)\\ = 1 and 

(E'(x L ),<J ) } = f3>0. 

It is clear that xl £ L fl D. For any A we have from the definition of p(E, A) 
that 

E(x L - A0) + E(x L + A0) < 2(E(x L ) + p(E, A)). (2.1) 



10 



Next by (JESD 

E{x L + \<j>) > E{x L ) + (F'Or L ), \(j>) = E{x L ) + A/3. (2.2) 
Combining (12. ip and (12. 2 j) we get 

F(x L -A0) <F(x L )-A/3 + 2p(F,A). (2.3) 
Taking into account that p(E,u) = o(u), we find A' > such that 

-A'/3 + 2p(E, A') < 0. 

Then (12. 3p gives 

F(x L - A» < F(x L ), 

which contradicts the assumption that x^, 6 L is the point of minimum of 
F. □ 

Lemma 2.2. For any bounded linear functional F and any dictionary T>, we 
have 

sup(F,o)= sup (F,f). 
gev /gAi(x>) 



Proof. The inequality 



sup(F,o) < sup (F,f) 
gev /6Ai(x>) 



is obvious. We prove the opposite inequality. Take any / G A\(V). Then 
for any e > there exist g{, . . . ,g e N G V and numbers a\, . . . ,a e N such that 
of > 0, a\ + • • ■ + a e N < 1 and 



N 



1=1 

Thus 



TV 

(F, /) < ||F||e + (F, V afo 6 ) < e||F|| + sup(F, g) 
which proves Lemma [2.21 □ 
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We define the following generalization of the WCGA for convex optimiza- 
tion. 

Weak Chebyshev Greedy Algorithm (WCGA(co)). We define 
Go := 0. Then for each m > 1 we have the following inductive definition. 

(1) ip m := ip^ G V is any element satisfying 

(-£'(G m _i),^) > t m sup( E (G m _i),p). 

g ev 

(2) Define 

$ m := $ T m := span{^}f =1 , 

and define G m := to be the point from $ m at which E attains the 
minimum: 

E{G m ) = inf E(x). 

The following lemma is a key lemma in studying convergence and rate of 
convergence of WCGA(co). 

Lemma 2.3. Let E be a uniformly smooth convex function with modulus of 
smoothness p(E,u). Take a number e > and an element f e from D such 
that 

E(r) < mfE(x) + e, r/A{e) G A 1 {V), 
with some number A(e) > 1. Then we have for the WCGA(co) 
E(G m ) - E(f) < E(G rn -i) - E(f*) 
+ inf {-\t m A{e)-\E{G rn ^) - E{f)) + 2p{E, A)), 
for m = 1, 2, . . . . 

Proof. It follows from the definition of WCGA(co) that £7(0) > E{G X ) > 
E(G 2 ) .... Therefore, if £(G m _i) - E(f e ) < then the claim of Lemma O 
is trivial. Assume E(G m -i) — E(f e ) > 0. By Lemma [1.11 we have for any A 

£(G m _! + \ Vm ) < E(G m -i) - A(-£'(G m _ 1 ), <p m ) + 2p(E, A) (2.4) 

and by (1) from the definition of the WCGA(co) and Lemma [2.21 we get 

(-E'(G m -i),<p m ) > t m swp(-E' (G m _i), g) = 

g ev 
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t m sup (-E'(G m -i), 4>) > t m A(e)~ 1 (-E'(G m ^i), f € ). 

<S>eAx<p) 

By Lemma [2. II and f ll.lOp we obtain 

(-E'(G m ^), f e ) = (—E'(G m -i), f e - G m _i> > E{G m - X ) - E(f e ). 
Thus, 

E{G m ) < inf S(G , m _ 1 + Xcp m ) 

A^U 

< E(G m -i) + inf (~Xt m A(e)-\E(G m ^) - E(f)) + 2p(E, A), (2.5) 

A^O 

which proves the lemma. □ 

We proceed to a theorem on convergence of the WCGA. In the formula- 
tion of this theorem we need a special sequence which is defined for a given 
modulus of smoothness p{u) and a given r = {tk}^L v 

Definition 2.1. Let p(E,u) be an even convex function on (—00,00) with 
the property: 

lim p(E, u)/u = 0. 

U— 5-0 

For any r = {tk}^ =1 , < % < 1, and > we define £ m := £ m (p, r, 0) as a 
number u satisfying the equation 

p(E } u) = 9t m u. (2.6) 

Remark 2.1. Assumptions on p(E,u) imply that the function 

s(u) := p(E,u)/u, u^O, 3(0) = 0, 

is a continuous increasing function on [0, 00). Thus \2.6\ has a unique solution 
= ■5~ 1 (6 l t m ) such that £ m > /or 9 < 9 := s(2). In i/ws case we aai>e 
£mO,T,0) < 2. 

Theorem 2.1. Lei E be a uniformly smooth convex function with modulus 
of smoothness p(E,u). Assume that a sequence t := {tk}^ =1 satisfies the 
condition: for any 9 G (0, 9 ] we have 

00 

t m £ m {p, r, e) = 00. 

m=l 

Taen 

lim E(G m ) = inf 
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Corollary 2.1. Let a convex function E have modulus of smoothness p(E,u) 
of power type 1 < q < 2, that is, p(E,u) < r yu g . Assume that 

oo 
m=l 



Then 



lim E(G m ) = inf 

ro— >-oo i£D 



Proof. The definition of the WCGA(co) implies that (£(G m )} is a non- 
increasing sequence. Therefore we have 

lim E(G m ) = a. 

m— >oo 

Denote 

6 := inf E(x), a := a — b. 

xED 

We prove that a = by contradiction. Assume to the contrary that a > 0. 
Then, for any m we have 

E{G m ) -b>a. 
We set e = a/2 and find / e such that 

E(f)<b + e and / e /A(e) 6 Aip), 

with some A(e) > 1. Then, by Lemma [2.31 we get 

E{G m ) - E(f) < E(G m -i) - E(f) + inf (-At m A(e)- 1 a/2 + 2p(E, A)). 

Let us specify 9 := min ^o, j and take A = £ m (p, r, 0). Then we 
obtain 

E(G m ) < E(G m ^i) — 29t m ^ m . 

The assumption 

oo 
m=l 

brings a contradiction, which proves the theorem. □ 
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Theorem 2.2. Let E be a uniformly smooth convex function with modulus 
of smoothness p(E,u) < r yu q , 1 < q < 2. Take a number e > and an 
element f e from D such that 

E(f) < M E(x)+e, r/A(e) e A 1 (V), 
with some number A(e) > 1. Then we have for the WCGA(co) (p := q/(q — 



2e, C(q, 7 ) A(e) 9 C(E, g )T ) + £t P J 

(2.8) 

Proof. Denote 

a n :=E(G n )-E(f). 

The sequence {a n } is non-increasing. If a n < for some n < m then E(G m ) — 
E(f e ) < and E{G m )— mi x &D E(x) < e which implies (I2.8p . Thus we assume 
that a n > for n < m. 
By Lemma [2.31 we have 

a m < + mf (-^p + ■ (2-9) 

Choose A from the equation 



which implies that 



Let 



A 



1 

(tm,Q"m—l \ q 1 



A 9 :=2(4 7 ) — . 
Using the notation p := we get from (12. 9ft 

a m < a m _i ^1 - ^[tO = a m _i(l - ^a^L 1 1 /(A (? A(e) p )). 
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_L 

1- 

the inequality x r < x for r>l,0<x<l, we obtain 



Raising both sides of this inequality to the power and taking into account 



a^ 1 < a-_\(l - C<-i/(^(e) P ))- 
We now need a simple known lemma (see [IB]). 

Lemma 2.4. Suppose that a sequence yi > 2/2 > ■ ■ • > satisfies inequalities 

Vk < i(l - WkVk-i), w k > 0, 
/or k > n. Then for m > n we have 



in 



1 1 

— > h 2^ 

Proof. It follows from the chain of inequalities 

11/ x-1 1 / 1 

— > {l — Wkl/k-i) > (l+WfcZ/fc-i) = Vwk- 

Uk Vk-i Vk-i Vk-x 



n=X 



1-9 



□ 



By Lemma E31 with y k := a q k ~\ n = 0, w k = t p J \A q A(e) p ) we get 
~ <C 1 ( q ,j)A(er[C(E,q,j) + J2t P n 

which implies 

a m <C( g ,7)i( £ )'|C(£, 9 , 7 ) + ^: 

V n=X / 

Theorem 12.21 is now proved. □ 



3 Relaxation. Co- convex approximation 

In this section we study a generalization for optimization problem of relaxed 
greedy algorithms in Banach spaces considered in [H]. Let r := {tk} k x = i be 
a given weakness sequence of numbers t k £ [0, 1], k = 1, 
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Weak Relaxed Greedy Algorithm (WRGA(co)). We define G : = 
G T q T := 0. Then, for each m > 1 we have the following inductive definition. 

(1) <f m := ip 7 ^ G V is any element satisfying 

(—E(G m -i), (f m — G m _i) > t m sup(— E (G m _x), g — G m _i). 

g ev 

(2) Find < A m < 1 such that 

E((l - A m )G m _i + \ m ip m ) = inf £((1 - A)G m _i + Ay? m ) 

0<A<1 

and define 

G m := G^ T := (1 — X m )G m -i + \ m (p m . 

Remark 3.1. It follows from the definition of the WRGA that the sequence 
{E(G m )} is a non-increasing sequence. 

We call the WRGA(co) relaxed because at the mth step of the algorithm 
we use a linear combination (convex combination) of the previous approx- 
imant G m _i and a new element (p m . The relaxation parameter A m in the 
WRGA(co) is chosen at the mth step depending on E. We prove here the 
analogs of Theorems 12.11 and 12.21 for the Weak Relaxed Greedy Algorithm. 

Theorem 3.1. Let E be a uniformly smooth convex function with modulus 
of smoothness p(E,u). Assume that a sequence t := {tk}kLi satisfies the 
condition: for any 9 G (0, #0] we have 



m=l 



Then, for the WRGA(co) we have 

lim E(G m ) = inf E(x). 

m->-oo zeAi(X>) 

Theorem 3.2. Let E be a uniformly smooth convex function with modulus 
of smoothness p(E,u) < -yu q , 1 < q < 2. Then, for a sequence t := {£fc}£L l7 
tk < 1, k — 1, 2, . . . , we have for any f G Ai(T>) that 

E(G m )-E(f)< 1 + ^(9, 7)53*2 • P:= ^I' 

V k=i J q 

with a positive constant Ci(g, 7) which may depend only on q and 7. 
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Proof. This proof is similar to the proof of Theorems 12.11 and 12.21 Instead of 
Lemma 12.31 we use the following lemma. 

Lemma 3.1. Let E be a uniformly smooth convex function with modulus of 
smoothness p(E,u). Then, for any f e Ai(T>) we have 

E(G m ) < £(G m _i)+ inf (-Xt m (E(G m ^)-E(f))+2p(E, 2A)), m = 1, 2, . . . . 
Proof. We have 

G m '■= (1 — ^m)G m -i + X m ip m — G m _i + X m (ip m — G m _i) 

and 

E(G m ) = inf E(G m -i + A(y? m - G m _i)). 

0<A<1 

As for (12 ,4p we have for any A 

E{G m -i + X(ip m — G m -i)) 

< E(G m -i) - A(-£'(G m -i), - G m -i) + 2p{E, 2A) (3.1) 
and by (1) from the definition of the WRGA(co) and Lemma [2.21 we get 

(— E (G m _i), <fi m — G m -i) — tm sup(—E'(G m ^i),g — G m _i) = 

gex> 

£ m sup (—E , (G m -.i),<j> — G m -i)>t m (—E , (G m -i),f — G m -i). 

By (TTTOl) we obtain 

(—E'{G m -x), f - G m _x) > S(G m _i) - £(/). 

Thus, 

E(G m ) < inf E(G m ^i + A(y? m - G^-i)) 

0<A<1 

< £(G m _!) + inf (-At m (£(G m _!) - £(/)) + 2p(£, 2A), (3.2) 
which proves the lemma. □ 
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The remaining part of the proof uses the inequality (13. 2p in the same 
way relation (I2.5p was used in the proof of Theorems 12.11 and 12.21 The only 
additional difficulty here is that we are optimizing over < A < 1. In the 
proof of Theorem 13.11 we choose 9 = a/8, assuming that a is small enough 
to guarantee that 9 < 9 Q and A = £ m (p, r, 9)/2. 

We proceed to the proof of Theorem 13.21 Denote 

a n :=E(G n )-E(f). 

The sequence {a n } is non-increasing. If a n < for some n < m then E(G m ) — 
E{f) < which implies Theorem l3.2l Thus we assume that a n > for n < m. 
We obtain from Lemma 13.11 

a m < a-m-i + inf (-At m a m _i + 27(2A) 9 ). 

0<A<1 

We choose A from the equation 

At m a m _i = 4 7 (2A) 9 (3.3) 

if it is not greater than 1 and choose A = 1 otherwise. The sequence {a k } is 
monotone decreasing and therefore we may choose A = 1 only at first n steps 
and then choose A from (13. 3p . Then we get for k < n 



and 



For k > n we have 



a k < - t k /2) 

n 

a n <a l[(l-t k /2). (3.4) 



k=i 



B»<a w (l-AW2), A=(%g^y~\ (3.5) 



2 2 +«7 

As in the proof of Theorem 12.21 we obtain using Lemma 12.41 

1 1 m _j_ f 
— > \- y^w k , y k := a q k ~\ w k := k ; 

By (JS3D we get 

1 1 n 

— > -17(1-4/2)^. 
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Next, 

n n n 

- t k /2)^ > + t k /2)T* > + t k /2) 

k=l k=l k=l 

k=l k=l 

Combining the above inequalities we complete the proof. □ 

4 Free relaxation 

Both of the above algorithms, the WCGA(co) and the WRGA(co), use the 
functional E"(G m _i) in a search for the mth element ip m from the dictio- 
nary to be used in optimization. The construction of the approximant in 
the WRGA(co) is different from the construction in the WCGA(co). In the 
WCGA(co) we build the approximant G m so as to maximally use the mini- 
mization power of the elements (fx, . . . , (p m . The WRGA(co) by its definition 
is designed for working with functions from A\(V). In building the approxi- 
mant in the WRGA(co) we keep the property G m G Ai(V). As we mentioned 
in Section 3 the relaxation parameter X m in the WRGA(co) is chosen at the 
mth step depending on E. The following modification of the above idea of 
relaxation in greedy approximation will be studied in this section (see |15j). 

Weak Greedy Algorithm with Free Relaxation (WGAFR(co)). 
Let r := {t m }^ =l , t m G [0, 1], be a weakness sequence. We define Go : = 0. 
Then for each m > 1 we have the following inductive definition. 

(1) ip m G V is any element satisfying 

(-£'(G m _i),^) > t m sup( E {G m —\),g). 

g ev 

(2) Find w m and A m such that 

E((l - w m )G m -i + X m <f m ) = inf E((l - w)G m - 1 + Xip m ) 

and define 

G m := (1 — w m )G m -i + \ m ip m . 

Remark 4.1. It follows from the definition of the WGAFR(co) that the 
sequence {E(G m )} is a non-icreasing sequence. 
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We begin with an analog of Lemma 12.31 

Lemma 4.1. Let E be a uniformly smooth convex function with modulus of 
smoothness p(E,u). Take a number e > and an element f e from D such 
that 

E(D < mfE(x) + e, / e /A(e) G A 1 (V), 
with some number A(e) > 1. Then we have for the WGAFR(co) 
E{G m ) - E{t) < E{G m - X ) - E(f e ) 
+ mi(-\t m A(ey\E(G m ^) - E(f e )) + 2p(E, C X)), 

for m — 1, 2, . . . . 

Proof. By the definition of G m 

E(G m ) < inf E(G m „i - wG m _i + X<p m ). 

\>0,w 

As in the arguments in the proof of Lemma 12.31 we use Lemma 11.11 

E{G m _i + \ip m - wG m ^i) < E(G m _i) 

- X(-E'(G m ^), Vm ) -w(E'(G m ^),G m ^) +2p(E,\\X Vm -wG m ^\\) (4.1) 
and estimate 

(-£'(G m _i),</?m) > t m sup(-E'(G m -i),g) = 

t m sup (-E'(G m ^),cj ) )>t m A(e)- 1 (-E l (G m ^),r). 

We set w* := At m A(e) _1 and obtain 

E{G m _i - w*G m _i + Xip m ) 

< E(G m _i) - Xt m A(e)- 1 (-E'(G m . 1 ), f ~ G m _i>. (4.2) 
By ffTTTUj) we obtain 

(-^(G^), r - G m _i) > S(G? m _0 - £(r). 

Thus, 

E(G m ) < E(G m -i) 
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+ inf {-\t m A{e)~\E{G m ^) - E(f )) + 2p(E, \\\<p m - w*G m ^\\). (4.3) 
We now estimate 

11^*^-1 - A<^ OT || < u7*||G m _i|| + A. 

Next, E{G m ^i) < E(0) and, therefore, G m _i G D. Our assumption on 
boundedness of D implies that ||G m -i|| < G\. Thus, under assumption 
A{e) > 1 we get 

w <J C\\t m ^ C\\. 

Finally, 

\\w*G m _i - \(f m \\ < C X. 
This completes the proof of Lemma 4.1. □ 

We now prove a convergence theorem for an arbitrary uniformly smooth 
convex function. Modulus of smoothness p(E, u) of a uniformly smooth con- 
vex function is an even convex function such that p(E, 0) = and 

lim p(E, u)/u = 0. 

u— >0 

Theorem 4.1. Let E be a uniformly smooth convex function with modulus 
of smoothness p(E,u). Assume that a sequence r := {tk}^ =1 satisfies the 
following condition. For any 9 G (0, 9q] we have 

oo 

Y,tmU(p,r,9) = 00. (4.4) 

m=l 

Then, for the WGAFR(co) we have 

lim E(G m ) = inf E(x). 

m— >oo xdD 

Proof. By Remark 14. 1} {E(G m )} is a non- increasing sequence. Therefore we 
have 

lim E(G m ) = a. 

m—too 

Denote 

b := inf E(x), a := a — b. 
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We prove that a = by contradiction. Assume to the contrary that a > 0. 
Then, for any m we have 

E(G m )-b>a. 
We set e = a/2 and find f e such that 

E(f)<b + e and f e /A{e) e A^V), 

with some A(e) > 1. Then, by Lemma [4. II we get 

£(G m ) - S(/ e ) < E{G rn ^) - E{f) + M{-\t rn A{e)- l a/2 + 2p(£, C A)). 

Let us specify 9 := min (o , gj^j and take A = C (, m (p,T,9). Then we 
obtain 

E{G m ) < E(G m -i) — 26t m £ >m . 

The assumption 

oo 
m=l 

brings a contradiction, which proves the theorem. □ 

Theorem 4.2. Let E be a uniformly smooth convex function with modulus 
of smoothness p(E,u) < ^vfl , 1 < q < 2. Take a number e > and an 
element f e from D such that 

E(f) < ME(x)+e, r/A(e) e A 1 (V), 
with some number A[t) > 1. Then we have (p := q/{q — 1)) 



E(G m ) - inf E(x) < max 2e, C(q, -y)A(e) q C(E, q, 7) + V t 



fc=l 



(4.5) 



Proof. Denote 

a n :=E{G n )-E{f e ). 

By Lemma [4.11 we have 

a m < a m _! + inf ^_^gzi + 2 T (C A)« ) . (4.6) 
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Choose A from the equation 

= 4 7 (CoA)«. 

A{e) 

The rest of the proof repeats the argument from the proof of Theorem l2.2l □ 

5 Comments 

We already mentioned in the Introduction that the technique used in this 
paper is a slight modification of the corresponding technique developed in 
approximation theory (see [14] , [T6] and the book p2]). We now discuss this 
in more detail. We pointed out in the Introduction that at a first glance 
the settings of approximation and optimization problems are very different. 
In the approximation problem an element / G X is given and our task 
is to find a sparse approximation of it. In optimization theory an energy 
function E(x) is given and we should find an approximate sparse solution 
to the minimization problem. It turns out that the same technique can be 
used for solving both problems. In nonlinear approximation we use greedy 
algorithms, for instance WCGA and WGAFR, for solving this problem. The 
greedy step is the one where we look for <p m G V satisfying 

(fm) > t m sup Ff^^g). 

This step is based on the norming functional Ff m _ x . As we pointed out 
in the Introduction the norming functional Ef m _ 1 is the derivative of the 
norm function E(x) := \\x\\. Clearly, we can reformulate our problem of 
approximation of / as an optimization problem with E(x) := ||/ — It is a 
convex function, however, it is not a uniformly smooth function in the sense 
of smoothness of convex functions. A way out of this problem is to consider 
E(f, x, q) := ||/ — x\\ q with appropriate q. For instance, it is known (see [I]) 
that if p(u) < 1 < q < 2, then E(f, x, q) is a uniformly smooth convex 
function with modulus of smoothness of order u q . Next, 

E\f,x,q) = -q\\f-x\r 1 F f . x . 

Therefore, the algorithms WCGA(co), WRGA(co) and WGAFR(co) coincide 
in this case with the corresponding algorithms WCGA, WRGA and WGAFR 
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from approximation theory. In the proofs of approximation theory results we 
use inequality (II. 2p and the trivial inequality 

\\x + uy\\ > F x (x + uy) = \\x\\ + uF x (y). (5.1) 

In the proofs of optimization theory results we use Lemma 11.11 instead of 
inequality ( II. 2p and the convexity inequality ( II. 9p instead of (15. ip . The rest 
of the proofs uses the same technique of solving the corresponding recurrent 
inequalities. 

Our smoothness assumption on E was used in the proofs of all theorems 
from Sections 2-4 in the form of Lemma 11.11 This means that in all those 
theorems the assumption that E has modulus of smoothness p(E, u) can be 
replaced by the assumption that E satisfies the inequality 

E(x + uy) -E(x) -u(E'{x),y) < 2p{E,u\\y\\), x e D. (5.2) 

Moreover, in Section 3, where we consider the WRGA(co), the approximants 
G m are forced to stay in the Ai(T>). Therefore, in Theorems I3.1l and l3.2l we 
can use the following inequality instead of (15. 2p 

E(x + u(y - ar)) - E(x) - u(E'(x),y - x) < 2p(E,u\\y - x\\), (5.3) 

for x, y E Ai(V) and u E [0, 1]. 

We note that smoothness assumptions in the form of (15. 3p with p(E, u\\y— 
x\\) replaced by C\\y — x\\ q were used in [20]. The authors studied the version 

of WRGA(co) with weakness sequence % — 1, k — 1, 2, They proved 

Theorem 13.21 in this case. Their proof alike our proof in Section 3 is very 
close to the corresponding proof from greedy approximation (see [H], [16] 
Section 3.3 or [17] Section 6.3). 

We now make some general remarks on the results of this paper. As we 
already pointed out in Introduction a typical problem of convex optimization 
is to find an approximate solution to the problem 

w := inf E{x). (5.4) 

X 

In this paper we are interested in sparse (with respect to a given dictionary 
T>) solutions of (15.41) . This means that we are solving the following problem 
instead of (15.41) . For a given dictionary T> consider the set of all m-term 
polynomials with respect to T>: 

m 

E m (V) := {x E X : x = ^ c^, g { E V}. 

i=i 
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We solve the following sparse optimization problem 

w m := inf E(x). (5.5) 

In this paper we have used greedy-type algorithms to solve (approximately) 
problem (15. 5p . Results of the paper show that it turns out that greedy-type 
algorithms with respect to T> solve problem ( 15. 4 p too. 

We are interested in a solution from E m (P). Clearly, when we optimize 
a linear form (F, g) over the dictionary T> we obtain the same value as opti- 
mization over the convex hull A\{T>). We often use this property (see Lemma 
12. 2p . However, at the greedy step of our algorithms we choose 

(1) ip m := <f^ e V is any element satisfying 

(— £"(G m _i), (p m ) > t m sup(- J E / ( G, m-i),5')- 

Thus if we replace the dictionary D by its convex hull Ai(T>) we may take 
an element satisfying the above greedy condition which is not from T> and 
could be even an infinite combination of the dictionary elements. 

Next, we begin with a Banach space X and a convex function E(x) defined 
on this space. Properties of this function E are formulated in terms of Banach 
space X. If instead of Banach space X we consider another Banach space, for 
instance, the one generated by Ai(T>) as a unit ball then the properties of E 
will change. For instance, a typical example of E could be E(x) := \\f — x\\ q 
with || • || being the norm of Banach space X. Then our assumption that the 
set D := {x : E(x) < E(0)} is bounded is satisfied. However, this set is not 
necessarily bounded in the norm generated by Ai(T>). 
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